
STOP 



Early Journal Content on JSTOR, Free to Anyone in the World 

This article is one of nearly 500,000 scholarly works digitized and made freely available to everyone in 
the world by JSTOR. 

Known as the Early Journal Content, this set of works include research articles, news, letters, and other 
writings published in more than 200 of the oldest leading academic journals. The works date from the 
mid-seventeenth to the early twentieth centuries. 

We encourage people to read and share the Early Journal Content openly and to tell others that this 
resource exists. People may post this content online or redistribute in any way for non-commercial 
purposes. 

Read more about Early Journal Content at http://about.jstor.org/participate-jstor/individuals/early- 
journal-content . 



JSTOR is a digital library of academic journals, books, and primary source objects. JSTOR helps people 
discover, use, and build upon a wide range of content through a powerful research and teaching 
platform, and preserves this content for future generations. JSTOR is part of ITHAKA, a not-for-profit 
organization that also includes Ithaka S+R and Portico. For more information about JSTOR, please 
contact support@jstor.org. 



A STEP FORWARD IN THE METHODOLOGY OF NATURAL 
SCIENCE (AN INTRODUCTION TO: THE FUNCTIONAL 
RELATION OF ONE VARIABLE TO EACH OF A 
NUMBER OF CORRELATED VARIABLES DE- 
TERMINED BY A METHOD OF SUCCESSIVE 
APPROXIMATION TO GROUP AVER- 
AGES. BY GEORGE F. McEWEN 
AND ELLIS L. MICHAEL). 

By Wm. E. Rittek. 
Received April 17, 1919. Presented May 14, 1919. 

Modeen biology, as the phrase is generally understood, is a devel- 
opment of laboratory and experimental methods. This development 
has been unprecedentedly rapid and rich. No one can deny this. 
But neither can any one fail to see, if he faces the situation squarely, 
that such development has rather narrow and wholly insurmountable 
limitations. 

Only a relatively small part of all the phenomena of living nature 
can be brought into the confines of the laboratory or by any means 
whatever subjected to control. The sciences of organic nature, 
botany and zoology, are in like case with those of inorganic nature, 
astronomy, geology, physical geography, meteorology, etc., as re- 
gards controllability. This is only an illustrative way of expressing 
the general truth that every science is able to study the phenomena 
of its province to only a relatively limited extent in the laboratory, 
or by experiment in the manipulative sense. As to the overwhelm- 
ingly vaster part of nature, those who would investigate it in very 
fact must go where it is, as far as this is possible, and where this is 
impossible must reach it by such indirect means as may be devised. 
This is so patent a truth as hardly to need mention: if the astronomer 
would investigate the stars of the southern sky he must go to the south- 
ern hemisphere; that is, must go where he can see those stars, and 
must study them as he finds them, not as he might wish to by manipu- 
lating them in a laboratory or on an experimental plot of the heavens. 
Similar conditions and limitations are imposed upon the biologist. 
If he is to study the starfishes of the southern hemisphere, he must 
go to oceans in that part of the earth and resort to such means as he 



92 RITTER. 

can to find and examine the creatures where nature has placed them. 
True, the biologist has one great advantage over the astronomer: 
not only can he actually get starfishes into his hands, but he can take 
many of them — or the cadavers of them — home with him. In a 
word, the biologist has the advantage of being able to study in his 
laboratory, and by experimentation, the bodies themselves which are 
the subject-matter of his science. 

These general reflections lead to a still more general reflection on 
the character of the various sciences, which may be introduced by 
the question: How is it that physics and chemistry are so largely 
sciences of the laboratory and of experiment as to make them always 
stand as types of the experimental sciences? The reply is that these 
sciences are not natural sciences in the full sense; that is, in the sense 
of dealing, each in itself, with a delimited province of nature. They 
are sciences which concern themselves with certain elements and 
attributes of all nature, but not exhaustively with any portion of 
nature. Especially they do not deal with forms and changes in the 
time series which all natural bodies undergo. They are not natural 
history sciences. Gravitation is one and the same to the physicist 
whether manifested by a human body or an iceberg. Light is light, 
so far as fundamentals go, whether its source be a lighthouse, a fire- 
fly, or a sun. Similarly, it is all the same to the chemist whether his 
sample of potassium, provided it is pure, is extracted from a kelp 
plant or a crystal of feldspar. 

On the other hand, the several natural history sciences aim to deal 
exhaustively with all the phenomena presented in their respective 
domains of nature. 

These remarks appertain to such common-places in modern science 
that the making of them would not be justifiable but for certain impli- 
cations they bear, which have not received due recognition in the 
methodology of natural knowledge. The one of these implications 
which concerns us at present may be expressed thus : The peculiarities 
of the two groups of science, as indicated, namely, the group character- 
ized by dealing with definitive portions of nature only, and that 
characterized by dealing with particular attributes only of all bodies, 
brings it to pass that the two groups supplement and depend upon 
each other in a more fundamental way than has been fully recognized 
by the methods actually used in either group. That the natural 
history sciences can reach full rounding-out only by supplementing 
their own particular discoveries and methods by those of physics and 
chemistry has received more recognition than has the fact that physics 



METHODOLOGY OF NATURAL SCIENCE. 93 

and chemistry must use the discoveries and methods of the natural 
history sciences in order to round themselves out. But physics, under 
such general designations as geo-physics and celestial physics, seems 
now to be moving rapidly toward a clear perception of its proper re- 
lation to, and dependence upon, the natural history sciences. The 
notable recent achievements in terrestrial magnetism, geodesy, meteo- 
rology, oceanography, and stellar distribution and growth, may be 
specially noted in illustration of this movement. 

Although chemistry is considerably behind physics in discovering 
its interdependence with natural history, astronomical spectroscopy, 
taxonomic bio-chemistry, and especially the discoveries in radio 
activity, in so far as these are revealing the evolutional changes and 
phyletic affinities of chemical substances, are highly suggestive as 
to what the future may have in store for chemistry. It seems that 
chemistry has reached a stage in which it recognizes itself as no longer 
justified in assuming, on the basis of any evidence it possesses, " that 
the elements of to-day were eons ago the same substances and pre- 
served their properties unaltered." x This by itself is an important 
step toward converting chemistry into a genuinely historical science. 

Now comes the kernel of this communication. The drawing into 
more vital mutual dependence of the two groups of science, the exact 
sciences, formerly so-characterized, and the natural or descriptive 
sciences, formerly so-called, 2 might be expected to enrich both groups. 

For that is exactly what all natural drawing together does. And 
expectation is being realized. The paper on method here presented 
has grown out of the joint labors of a mathematical physicist working 
at oceanography as a branch of geo-physics, and a systematic zoologist 
working at the distribution of animals as an aspect of the broader 
problem of the relation of organisms to their natural environments, 
the two investigators having been brought together in the enterprise 
of gaining as much knowledge as possible of the pelagic life of a 
particular, restricted area of the Pacific Ocean. 

Specifically the problem is: Given the requisite taxonomic knowl- 
edge of a natural group (an order, say, with its several genera and 
species) of pelagic animals, and the requisite facts as to the vertical 
distribution of these animals through diurnal and annual cycles; 

1 "Old Age" of Chemical Elements, by Ingo W. D. Hackh, Science, April 4, 
1919, p. 328. 

2 As though a flock of seven wild geese were not physical and the counting 
of them were not exact; and as though a discharge of electricity between 
clouds were not natural and could not be or did not need to be, described. 



94 EITTEE. 

and given further the requisite factual knowledge of the temperature, 
salinity, light, etc., of the waters inhabited by the animals, how do 
the observed changes of the several environmental elements operate 
as casual factors in the distribution of the animals, this operation 
being inferred from such correlations as may be discovered in the two 
series of quantities, biologic and oceanographic? 

Obviously, the immediate problem is one of applied statistics; that 
is, of dealing with long numerical series of natural phenomena, which 
phenomena have been measured. 

Obviously, too, the method is one of dealing with phenomena as 
they occur in nature, as contrasted with the treatment of phenomena 
which may occur in a laboratory, or under conditions of manual experi- 
mentation. Particular attention is called to the fact that this last 
statement is equivalent to saying that the method is primarily induc- 
tive rather than deductive. And attention is called to the further 
facts that the case is illustrative of the very wide truth that so far 
as concerns the interpretation of actual nature, both animate and 
inanimate, laboratory and experimental methods are necessarily 
deductive for the most part; and that such interpretation can be 
made inductively only by carrying research into the "field" and put- 
ting quantitative determinations on a statistical basis. 

Incidentally it may be pointed out that should the method prove 
practicable and trustworthy, it would be highly useful since there 
. is a wide range of similar problems, many of them exceedingly impor- 
tant. So far as the principle is concerned, its applicability would be 
to the entire expanse of living nature, because all organisms, man 
with the rest, are subject to natural environments of some sort, and 
the very essence of the method is its effort to bring together data 
pertaining to organisms and their environments thus taken. 

Indeed, the method is applicable to very many phenomena of nature 
outside the organic realm, as to those of the atmosphere, of the land 
masses, and of the waters of the earth. 



THE FUNCTIONAL RELATION OF ONE VARIABLE TO 

EACH OF A NUMBER OF CORRELATED VARIABLES 

DETERMINED BY A METHOD OF SUCCESSIVE 

APPROXIMATION TO GROUP AVERAGES: 

A CONTRIBUTION TO STATISTICAL 

METHODS. 1 

By George F. McEwen and Ellis L. Michael. 
CONTENTS. 

Page 

1. Kemarks on methods of acquiring knowledge 95 

2. General statement of problem and mode of attack 97 

3. Mathematical demonstration: 

A. The case when variability within the group is neglected . . 100 

B. The case when variability within the group is taken into 
account 105 

4. Illustration of method by solution of a particular problem concern- 
ing the relation between temperature, precipitation, and yield of 
wheat in South Dakota 113 

5. Supplementary considerations 127 

6. Literature cited 133 

1. Remarks on Methods of Acquiring Knowledge. 

One method of acquiring knowledge, the deductive, is to formulate 
fundamental concepts and principles that are simple but comprehen- 
sive, and to attempt to deduce therefrom the sequences and other 
relationships observed in nature. This implies that nature conforms 
to a logical system and, consequently, discovery of the few basic 
elements of the system, together with suitable logical treatment, 
furnishes a description of the observed phenomena. Considered 
quantitatively, this method of acquiring knowledge is the classical one 
of applied mathematics. 

Another method is the " inductive " or, better, the empirical method. 
One becomes directly aware of innumerable facts by means of sense 
perceptions. Many are acquired without particular effort and seem 
trivial, while special attention is directed toward acquisition of others 
of more apparent significance. Closely associated with this process 
of observation is that of description and classification, which makes 

1 Presented in abstract before the San Francisco Section of The American 
Mathematical Society, April 7, 1917. 



96 MCEWEN AND MICHAEL. 

possible a comparison with the results obtained by other observers. 
Moreover, classification is the first step toward the important object 
of determining the uniformities and other relationships exhibited by 
the mass of facts in question. From definitions of the individual 
facts, and their classification into groups we pass, by induction, to a 
description of the observed system as a whole, i. e., to an empirical 
law Though the observed system is only a- fragmentary sample of 
the "universe" it represents, the empirical laws are just as real, 
insofar as they describe that sample, as are the individual facts. 
The gain in simplicity and conciseness is made, of course, at the 
expense of detail, as is true of any summary. Further, experience 
shows that, as a rule, there is not in nature a one-to-one correspondence 
between observations of one kind and those of another kind. A 
plurality of causes, influences, factors, or whatever one may call 
them, must, in general, be considered, and their mutual relations 
taken into account. Considered quantitatively, this method of 
acquiring knowledge is statistical mathematics, 1 the ideal of which is 
attained when the empirical laws and assignment of a value to each 
kind of quantity but one serves to determine the latter. 

In former times, when the fundamentals inherent in the deductive 
method were largely the result of introspection, attention of scholars 
was directed mainly to these fundamentals and their logical conse- 
quences. Initial concepts and deductions from them were regarded 
as the realities of nature, while direct evidence of the senses was dis- 
credited. In spite of the downfall of scholasticism four centuries ago 
there still is a strong tendency to regard observations as secondary 
in importance, and even of no importance when they have no appar- 
ent bearing upon some dominant theory or fail to fit in with a prevail- 
ing practice. Such a tendency does not inhere in the empirical 
method, which not only yields results as free as possible from personal 
bias and preconceived opinion, but, when conscientiously applied, 
affords the only basis for certain knowledge concerning any objective 
phenomenon. 

After the objective phenomena constituting the subject-matter of 

l The phrase "mathematical statistics" might have been better, since 
it is in common use, were it not the prevailing practice, especially among 
biologists, to base the mathematical reasoning, more or less unconsciously, 
upon a preconception to which the statistical data treated do not necessarily, 
or evenly usually, conform, e. g., the Gaussian law of error. Such a practice 
is essentially an application of the deductive method to statistical problems; 
statistical mathematics, on the other hand, is strictly empirical since the 
mathematical logic is based solely upon the statistical data at hand. 



RELATION OF VARIABLES. 97 

any particular scientific inquiry have been observed and described 
with thoroughness and classified into empirical laws and generaliza- 
tions, it is proper and desirable to employ the deductive method and 
ascertain the extent to which that particular natural system conforms 
to a logical one. But, it not infrequently happens that the specialist 
extends theories that have proved useful in his restricted researches 
to classes of phenomena with which he is unfamiliar without first 
making a critical examination as to their applicability. This tendency 
is manifested especially by the isolated or individual investigator 
whose attention is necessarily restricted to the limited observations 
he can make and to relevant data others may have gathered. When 
confronted with a problem calling for extensive observation in the 
field he is prone to carry it into the laboratory in an attempt, on the 
basis of theory and carefully conducted experiments, to reach. an 
explanation of phenomena never observed. The deductive method 
thus becomes a process of inventing facts to fit theories, instead of 
theories to fit facts. The empirical method, on the other hand, being 
concerned primarily with direct observation, often demands that 
investigation be carried on by an organized group of individuals work- 
ing in cooperation for the purpose of obtaining as wide a range of 
relevant observations as possible. Each method has its place in all 
investigations, and each has its limitations; and it is only by com- 
bining all relevant observations with induction and deduction that 
one can use his full powers of cognition and approach complete solu- 
tion of any problem in natural science. 



2. General Statement or Problem and Mode of Attack. 

How can the values of a variable, for example the yield of wheat 
per acre of a given region, measured at equal intervals of time, say 
annually, be used for predicting the yield for the ensuing year? The 
frequency with which the wheat yield has been observed to fall within 
given limits, divided by the total number of observations, is the em- 
pirical probability that the next yield of wheat will fall within those 
limits. Obviously, the more frequently the wheat yield has been 
observed to fall within given limits the larger will be this empirical 
probability, and the greater the total number of observations the 
nearer will this empirical probability approximate the true proba- 
bility, i. e., the frequency ratio that would have resulted had the 



98 MCEWEN AND MICHAEL. 

number of observations been infinite. This is one way of making 
predictions. 

Again, suppose observations show that the yield of wheat either 
increases or decreases on the average with respect to time, or that, 
within some definite period, there is a cycle or typical variation that 
is repeated in approximately the same manner in each period. In such 
cases information regarding the general trend or cycles affords a more 
satisfactory basis for prediction, i. e., it results in prediction within 
smaller limits for a given probability than is possible by the simpler 
frequency method. But, suppose some other phenomenon is also 
measured, for example, the rainfall during a given season of each year. 
If it be observed that, in general, a large rainfall is followed by a 
large yield of wheat, knowledge of the former could also be used to 
improve prediction of the latter. Still another improvement would 
be expected if the temperature during the growing season were also 
measured, and so on. 

Prediction, as thus illustrated, implies a lag of the quantity pre- 
dicted (dependent variable) behind the remaining quantities (inde- 
pendent variables). Considering the case in general this lag may be 
of any magnitude between any of the selected independent variables 
and the dependent one, or all may vary simultaneously. But, the 
problem of determining the empirical relations is the same, and, as 
more factors are measured and as the number of observations increases, 
approximation is had to the ideal of precise determination. However, 
under the most favorable conditions, some deviations between 
observed and computed values always remain, and these are called 
accidental or "chance" variations. Even in laboratory experiments, 
where the idea of artificial control over the independent variables is 
dominant, it is often necessary, in order to obtain best results, to cor- 
rect the dependent variable for unavoidable fluctuations due to vari- 
ables beyond control. In any natural problem, however, the factors 
involved are all necessarily variable and, as a rule, mutually corre- 
lated, so that, in any given case, one is confronted with the difficulty 
of selecting the most important factors, and the necessity of determin- 
ing the approximate functional relation of the dependent variable to 
each of the mutually correlated independent ones. 

If the functional relation between the variables is known to be 
approximately linear, or can be made so by introducing suitable func- 
tions, the usual method of multiple correlation may be applied. 
Again, in case the form of the functions expressing the relation of the 
dependent to each independent variable is known, the method of 



EELATION OF VARIABLES. 99 

least squares or the method of moments may be used to determine 
the numerical values of the constants appearing in the mathematical 
expressions. But in many, if not most, cases in practice the forms 
of the functions are quite unknown and must be determined solely 
from the data at hand. 

The object of this investigation is to devise a general method of 
obtaining the relation between a dependent variable and each of the 
mutually correlated independent ones without being compelled to 
employ an assumed or predetermined mathematical function. This is 
accomplished by applying to the observed values of the dependent 
variable successive corrections based upon each value* of all the inde- 
pendent variables. In this way is obtained a series of averages of 
the dependent variable corresponding to a series of averages of each one 
of the independent variables in turn, and corrected to a constant 
value of each of the remaining ones. Perhaps this will be more intel- 
ligible if stated in the concrete terms of the wheat problem. In this 
particular case the method is that of obtaining a series of averages 
of the wheat yield, corrected to a constant rainfall, corresponding to a 
series of temperature averages ; and a similar series of averages of the 
wheat yield, corrected to a constant temperature, corresponding to a 
series of rainfall averages. The averages thus obtained define, approxi- 
mately, the functional relation desired. 

The idea of defining a function by means of a series of correspond- 
ing values of dependent and independent variables is utilized in cer- 
tain problems of higher mathematics (Fredholm, 1900; 1903; Bocher, 
1909). But, in pure mathematics, it is possible to pass to the limit 
and obtain an infinite series of pairs of corresponding values, which 
defines the functional relation uniquely. In objective science this 
is impossible, and, although various well-known methods of inter- 
polation are available for approximating thereto, one is between 
the two horns of a dilemma. It is obvious that the effect of acci- 
dental variations is reduced to a minimum for any given number of 
observations when the number entering into each average is a maxi- 
mum, but this also reduces to a minimum the number of averages 
upon which definition of the functional relation depends. Stated 
otherwise, the greater the number of averages for a given number of 
observations the more precisely will the functional relation be deter- 
mined; but, owing to the larger effect of accidental variations, the 
less reliable will be the result. One must therefore use his judgment 
in classifying the data, and should test the reliability of the results. 

In closing this section, it may be of interest to mention how the 



100 MCEWEN AND MICHAEL. 

method happened to be devised. It had its origin in our collabora- 
tion on problems concerning the quantitative relation between varia- 
tions in the number of certain marine organisms and fluctuations in 
the elements of their environmental complexes. Attempts to elimi- 
nate the effects of correlation between the environmental elements 
by the method of multiple correlation and that of least squares, 
combined with various subsidiary expedients, proved highly unsatis- 
factory. The reason is that, at the outset of the mathematical reason- 
ing, assumptions of either a linear or some other definite type of 
regression, or of the functional form of the observation equations 
must be introduced for which no justification is afforded by the data 
themselves. After a fairly exhaustive study of the literature, which 
failed to provide any practicable and rigorous way of handling such 
problems, we were led to devise one which culminated, in part, in 
this method of successive approximation to group averages. Although 
the central idea is the product of our collaboration, the mathematical 
demonstration and the practical process of making the computations 
are primarily due to the senior author. Furthermore, the particular 
problems whose study led to developing this method are too complex 
to afford suitable means of illustration. For this reason the simpler 
problem of the relation between temperature, precipitation, and yield 
of wheat in South Dakota is used, a study of which, by means of 
multiple linear correlation, has been published by Blair (1918). 

The mathematical demonstration, while close and rigorous, is 
neither abstruse nor difficult. In the case when variability within 
the group is neglected (section 3 A) it involves nothing beyond the ele- 
ments of algebra. But, in the case when this variability is taken 
into account (section 3B) the demonstration also presupposes knowl- 
edge of linear regression, so that some readers may prefer to follow 
through the concrete process of computation given in section 4 illus- 
trating the first case, before turning attention to the analytic demon- 
stration in the second case. 



3. Mathematical Demonstration: 

A. the case when variability within the group is 
neglected. 

When the change in the dependent variable, w, corresponding to a 
given change in one independent variable, say x, is negligibly influenced 
by the magnitude of the constant values to which the remaining 



RELATION OF VARIABLES. 101 

independent variables y, z, etc., are reduced (see p. 128), the expres- 
sion for w takes the form 

«= /i (*)+/* GO +/I-0O + -.-. (i) 

where/1,/2,/3, etc., denote the unknown functional relations of w to x, 
y, z, etc. The problem is to determine each of these unknown func- 
tions from the numerical data. 

In laboratory experiments it is usually possible so to control the 
independent variables as to hold all but one, say x, at constant values, 
y, z, etc. In such instances the difference between any two values 
of the series /1 (xi),/i (x 2 ), /1 (X3), etc., can be readily found, where 
Xi, x 2 , X3, etc., are averages of x in each of a series of groups formed in 
succession from the values of x arranged in ascending order of magni- 
tude. The purpose of taking averages is to eliminate so far as possible 
effects of accidental variations due to variables beyond -control. The 
corresponding averages, Wi, W2, W3, etc., of the observed values of w, 
therefore are 

Wi = /!(xi) + M 
w 2 = /i(x 2 ) + M 
w 3 = /i(xs) + M (2) 



where 



M=My)+Mz) + .... (3) 



is a constant since y, z, etc., are constant. Similarly, the relation of 
w to y, w to z, etc., may be thus determined. 

But it is only under the artificial conditions of the laboratory that 
this simple way of determining the unknown functions is valid; 
and, even so, there is no guarantee that the same functional relations 
will hold good under natural conditions. In nature one is limited to 
observing what is actually taking place; all influences are beyond 
control; all vary simultaneously; and all are more or less correlated. 
Differences between successive values of w (equations 2) are therefore 
not due, in general, to the fluctuation in x alone, but also to fluctua- 
tions in the remaining variables, y, z, etc. Moreover, the effect of 
these remaining variables often is large enough to produce serious 
errors in the relations indicated by this simple mode of procedure. 

Such errors must be eliminated. To accomplish this, say in the 
relation of w to x, corrections are computed for the purpose of reducing 



102 MCEWEN AND MICHAEL. 

each value of w in the (w, x) series as nearly as possible to the value 
it would have had if y, z, etc., had constant values arbitrarily chosen. 
Regarding the average of the original values in each group of the (w, x) 
series as a first approximation, the second is obtained by applying to 
each value of w, corrections derived from the relation of w to y, w to z, 
etc., indicated by the original series of group averages, 1 (w, y), (w,z), etc. 
A second approximation to the relation of w to y is then obtained by 
introducing corrections based upon the second approximation in the 
(w,x) series and first approximations in the remaining series. The 
process is thus continued until second approximations are obtained 
to the relation of w to each of the remaining independent variables. 
By means of these second approximations in the (w, y), (w, z), etc., 
series a third approximation to the functional relation of w to x is 
obtained, and so on. It seems reasonable that such successive approxi- 
mations would result in convergence to values of w corresponding 
to a variation of only one independent variable at a time. This is 
confirmed by the following analytical demonstration, which also 
yields a practicable method of making and checking the computations. 
For clearness the analytical demonstration is given for the special 
case of three independent variables and three groups of each, but the 
same reasoning applies to the general case of any number of variables 
and groups. Arrange the values of the independent variable x in 
ascending order, segregate them into three groups, and let Xi, x 2 , and 
x 3 , be the average of x in the three groups respectively. Let A'. 
B*, and C* be the corresponding original averages of the dependent 
variable w, and A, B, and C be the required values o± these averages 
corresponding to constant values of the two independent variables, 
y an'd z. Denote the number 0? entries per group by Ni, N2, and N%. 
This notation, together with that for the y and ? variables, is presented 
in tabular form as follows: 



1 All values of the independent variable, y, in any one v-group are thus 
assumed to equal the average within that group, and similarly for the remaining 
independent variables, z. etc. To state it otherwise, in correcting for the 
effects of any variable, y, each value of w in each y-group is assumed to cor- 
respond to the average value of y within that group (see p. 105). 



RELATION OF VARIABLES. 



103 



TABLE I. 
General Notation Employed. 



A* B* C* 


D' E* F* 


G i jji p 


Original group averages of depend- 
ent variable, w. 


ABC 


D E F 


G H I 


Required group averages of depend- 
ent variable, w. 


X] X2 X3 


y4 y5 y 6 


Z7 Zg Z9 


Group averages of the values of the 
independent variable arranged in 
ascending order. 


NtNiNt 


NtNsNe 


N 7 N S N, 


Number of ent.-ies per group. 



In determining the relation of w to x, let all values of w be corrected 
for the variable y to its middle value V5, and for the variable z to its 
middle value z 8 . Likewise, in determining the relation of w to y, 
let all values of w be corrected for x to its middle value X2, and for z 
to its middle value z 8 . Similarly, in determining the relation of 
w to z, let all values of w be corrected for x to X2 and for y to ys. Cor- 
rection to these arbitrarily chosen "standard" values requires the 
following differences between the group averages: 



B i _ £i _ a t ; gi _ Qi _ c i 

B — A =a,B — C = c 

E* — D* = d\ E — F* = /' 
E-D=<Z,E-F=/ 

H* — G* = g\ W — V = i l 
H - G = g,K -I = i 



(4) 
(5) 

(6) 



Denote by A 1 any observed value of w in the (w, x{) group; by B' 
any observed value of w in the (w, £2) group, and so on to 7* for any 
observed value of w in the (w, z») group. Then one particular value of 
A' is identical with some particular value of w in the (w, y) series and 
also in the (w, z) series, i. e., it is identical with a particular value of 
D\ E\ or F\ and also of G\ H\ or IK Suppose it is in the (w, x{), 



104 MCEWEN AND MICHAEL. 

(w, y t ), and (w, z 8 ) groups, which, for convenience, will be referred 
to as groups 1, 4, and 8. The correction E — D = d (see equations 5) 
must be added to reduce this particular value of w to that correspond- 
ing to the standard value y 6 of y, but no correction need be applied for 
2, since it is in group 8 which was selected as the standard. If each 
value of w in group 1 be corrected in this way, the mean of the cor- 
rected values is, by definition, equal to the required value A. A 
similar procedure with respect to all the other groups gives the remain- 
ing required values B, C, D, E, F, G, H, and I. Evidently, this is 
equivalent to adding to the average of the observed values A* the 
average of all the corrections. That is 



where 



2.4* 1 
A = -r— + — (wm d + nitf + »n g + raw i) 

jy i iVi 

= A* + — (w M d + n u f + tin g + «w i) 

iVi 



«16 = 


a 


u 


«17 = 


u 


u 


«19 = 


<i 


a 



(7) 



Hi4 = number of observations common to groups 1 and 4 
" " ' " " land 6 

" " 1 and 7 
" " land 9 

In the same manner equations (8) to (15) are obtained. 

B = B* + — (thid + nvef + rhig + ih 9 i) (8) 

iV2 

C = C* + — (n M d + n 36 / + n s1 g + m«i) (9) 

•/V3 

D = D* + — (n a a + n^c + n^g + n M i) (10) 

IV4 

E = E* + — (n 5 ia + ns3C + n^g + n^i) (11) 

F = F* + — (w 61 o + n 63 c + n^g + n 69 i) (12) 

0=0'+^ (una + n n c + n u d + rvitf) (13) 

H = H* + — - (»gio + rissc + nsd + rage/) (14) 

i*8 



TABLE 2. 
Process op Solution by Successive Approximation. 



Substitute 



in equations 



and obtain 



Line 



Second approximation determined directly 



& 


f 


9 l 


i> 


7, 8, and 9 


A A" A B" A C" 


A" 


B" 


Qii 


a ii c ii £L a ii £l c ii 


1 


a" 


c u 


t 


i' 


10, 11, and 12 


A D" A E" A F" 


D" 


E" 


pii 


ftiijii fcLgii /±yu 


2 


a" 


c" 


d u 


fU 


13, 14, and 15 


A G" A H" A I" 


Qii 


H" 


Tii 


gii jii frlgii Al^ii 


3 



Third approximation determined directly and checked by first differences A 1 



d 
AW 

a 
Ah' 

a 
AV 



f 
A 1 /' 
1 c 
Wc 

' c 
»A x c 



9' 

Ay 

i g„ 

Hi Als.it 



*A J d 



A l i 

i 

AH 

Hi -f-t 



*** A' rfiii \i-fiii 



A 1 /' 



7, 8, and 9 
7, 8, and 9 
10, 11, and 12 
10, 11, and 12 
13, 14, and 15 
13, 14, and 15 



A A*"A B*"A C"' 
A 1 A i "A 1 B i "A 1 C i " 
A D*"A E*"A F*" 
A'D^'A^'^AT"' 
A G"*A H"*A I'" 

^lQ.iii A ITTiii A 'Jiii 



A iii giii Qiii 

A A"*A B"*A C ui 

T\iii Viii Viii 

A D*"A E"*A F"* 

C*iii TTiii Tiii 
A G*"A H*"A I Ui 



a iii c iii^l a iii/^l c iii 
(liii-fiiiAlsliiiAlfiii 
fliiijiiiAlfjiiiAljiii 

Ay^AV" 



4 
5 
6 

7 
8 
9 



Fourth approximation determined by first differences and checked by second differences A 2 



A , o" i A 1 /"'A 1 jr" i A 1 i*" 


7, 8, and 9 


A 1 A" , A 1 B i "A 1 C"' 


A A*" A B iB A C'* 


AV'AV" 


10 


A2/7"* ^fiii^QiiiApjiii 


7, 8, and 9 


A 2 A it, A 2 B*A 2 C i '' 


A'A'^B^A'C*" 


A 1 a™A 1 c"'A 2 a i 'A 2 c"' 


11 


A , a iv A J c iv A J g iii A t i iii 


10, 11, and 12 


A 1 D* 1, A 1 E'* , A 1 F"' 


A D™A E"A F ic 


A^A'/™ 


12 


A J a"A*c* , 'A 2 < *'A s i"' 


10, 11, and 12 


A 2 D i "A 2 E i "A 2 F* t ' 


A 1 D"A 1 E' 1 'A I F"' 


A^A^A^A 2 /"' 


13 


A I a™AV"A 1 d ft 'Ay 1 ' 


13, 14, and 15 


A 1 G ir A 1 H* , 'A I I™ 


A G*A H*"A I™ 


A l g iv AH iv 


14 


AV^AV'^cPA 2 /"' 


13, 14, and 15 


tfQivtfftivtfliv 


A 1 G"'A 1 H i, 'A 1 I"' 


A l g iv AH iv A 2 g iv AH iv 


15 



Supplementary explanations 



A" =AH-AA" Ba" = B"— A"=B'— AM-AB* 

A iii =A i +AA iii =a i +AB ii —AA ii 

±iv _ A i+ AA » a'" = aH- AB W — A A iiJ 
etc. a™ = a*+ AB ; " — A A™ 

etc. 
^yi»t_^i^« = £2d Ui = A 2 E'"— A 2 D*" 
AW™— A'd ii< = AH iv = A 2 & v — A 2 D i1 " 
etc. 



-AA" d u —d { = AW" =A , E ii — A X D* 
d'»-d'' = AW" i =A 1 E"*— A'D*' 
div_ ( iiu— fcidiv = AiE'"— A'D'" 
etc. 

A 2 A { " = A J A'* — A'A*" 
. A 2 A"=A'A , '-A l A'«' 
etc. 



fri£iu = AA wi — AA"' 

A 'A'* = A A** — AA'" 

etc. 



EELATION OF VARIABLES. 105 

I = V + — (n n a + n^c + n u d + n^f) (15) 

where na = «u, nu = n®, etc. 

These nine equations, together with equations (4), (5), and (6) 
defining the quantities a, c, d, f, g, and i, determine the nine unknowns, 
A, B, C, etc. They can be solved simultaneously, but as a rule, labor 
is saved by employing a method of successive approximation, the 
details of which are given in table 2. 

If the process of successive approximation be continued, as indicated 
by lines 1, 2, 3, and 4, 6, 8, and results in convergence 1 to definite 
limiting values, these values will evidently satisfy equations (4) to 
(15). In the third approximation the procedure indicated by lines 
5, 7, and 9, involving first differences, affords a numerical check on 
the computation of AA ui , AB iM , etc., and AV", AV", etc., of lines 
4, 6, and 8. It is possible to continue checking each computation in 
this way until convergence is attained. But, beginning with the 
fourth approximation a further saving of labor is effected by computing 
first differences (lines 10, 12, and 14) and checking these results by 
second differences (lines 11, 13, and 15), since all differences converge 
to zero. For a numerical illustration see page 113. 



B. THE CASE WHEN VARIABILITY WITHIN THE GROUP IS TAKEN INTO 

ACCOUNT. 

As stated on page 102, the variability, for example in a (w, x) 
group, due to the range in value of x in that group, and the correlation 
between w and x in that group, is neglected in the foregoing solution. 
Justification for this neglect depends upon the nature and magnitude 
of the variability, which, in turn, depends upon the range in value of 
the independent variable; magnitude of the change in the dependent 
variable due to a given change in the independent one; degree of 
correlation between the independent variables; and number of groups. 
When a large amount of data is at hand it is usually possible to classify 
it into a correspondingly large number of groups with respect to each 

1 No general criterion for convergence has been worked out, but it evidently 
depends upon the closeness of correlation between the independent variables. 
Of the ten problems to which the method has thus far been applied, ranging 
from the relation between dew point, humidity, and minimum air temperatures 
to the relation between body length, tail length, and foot length in mice, the 
greatest number of approximations required was fifteen. 



106 



MCEWEN AND MICHAEL. 



independent variable, in which event little gain in accuracy is made 
by taking account of variability within the group. But, when the 
number of groups is small, this is not generally true. In such cases 
variability within the group may be significantly decreased by intro- 
ducing corrections based upon an assumed linear regression of the 
dependent on the independent variable, e. g., a linear regression of 
w on x in a (w, x) group. By this means each value of the dependent 
variable may be approximately reduced to what it would have been 
had the independent variable remained at its constant average value, 
e. g. each value of w in the (w, X\) group may be reduced to a value 
corresponding to x = Xi. If the central idea of group averages has 
been made clear it will be obvious that the error introduced by an 
assumption of linear regression within the group is negligible. Accord- 
ingly, after applying this correction, the outstanding variability is 
legitimately attributed to " chance " and, after selecting the inde- 
pendent variables, can be further reduced only by increasing the num- 
ber of groups and the number of observations in each. 

For clearness, the analytical demonstration is given for the special 
case of two independent variables and three groups of each, but the 
same reasoning applies to the general case of any number of variables 
and groups, as well as to the case in which regressions are run in some 
of the groups and not in others. The notation is presented in table 3. 

TABLE 3. 
General Notation Employed. 



A* B s C* 


D* E* 


j"> 


Original group averages of dependent 
variable, w. 


ABC 


D E 


F 


Required group averages of dependent 
variable, w. 


Xl X2 X3 


y4 7i 


ye 


Group averages of the values of the inde- 
pendent variable arranged in ascending 
order. 


tf 1 N 2 N, 


iV 4 N & 


N* 


Number of entries per group. 


Tti Tfi Tti 
III *H -^3 


Rl Rl 


Rl 


Original regression coefficients. 


R\ R% R3 


Ri Jt5 


R$ 


Required regression coefficients. 



EELATION OF VARIABLES. 107 

The following equations, except for the terms involving R, are 
derived as before: 

A = A* + ±- { nii d + nuf\ - ±- {R£(y - y 4 )i + R£{y - y 6 )i 

JVi JVi 

+ R&(y - ye)!} (16) 

y 4 ) 2 + R£(y - ys) 2 

+ R#(y - y 6 ) 2 } (17) 

y 4 ) 3 + R£(y - y 6 )3 
+ R£{y - y 6 ) 3 } (18) 
!l) 4 + Ri2(x — x 2 ) 4 
+ R&(x - X3) 4 } (19) 
Ci)b + R£(x — X 2 ) 5 
+ R£(x - x 3 ) 6 } (20) 
Xi) 6 + Ri2(x — x 2 ) 6 
+ R£(x - X3) 6 } (21) 



b = b* + — imd + nuf\ - 4- {^s(y - f«)» + #52(2/ - y*)* 

iV2 iV2 



c = c* + ±r { nu d + «36/} - 4r i R ^(y - y«)« + R &(y - *)» 

JV 3 iV3 



D = D* + — {n a a + n^c) - — {R&fa - Xi) 4 + 'R£(x - x 2 ) 4 

iV4 iV 4 



£=£*+ — {resia + « 68 c) - — {#i2(a; - Xx) 6 + iJaSfr - x 2 ) 5 

■/V5 JV5 



F = F» + — {n 61 a + m^c} - ^- \R&(x - Xj) 6 + R£(x - x 2 ) 6 
•i»6 iVe 



where, 

ni 4 = number of observations of w common to group 1 and 4 
n 16 = " " « " " " « " 1 and 6 

and so on, and where 

B-A=a, B-C = c, E-D = d, and E - F = /. 

In order to correct for the position of w with respect to y in groups 
1, 2, and 3, and for w with respect to x in groups 4, 5, and 6, the terms 
involving the regressions R are added. They are readily derived. 
Consider, for example, the values of w common to groups 1 and 4 

(equation 16): — — is the correction to A* on the assumption that 

each value of w corresponds to the average value of y, i. e., to y 4 . 
But, when the variability in y is taken account of, the difference 



108 MCEWEN AND MICHAEL. 

between any particular value of y and its average value (y — y 4 ) 
must be multiplied by the coefficient expressing the regression of w 
on y in group 4, i. e., by .R4, and subtracted. Hence — R^(y — y4)i 
is the total correction due to the position in group 4 of the nu values 
of w common to groups 1 and 4, where the subscript outside of the 
bracket signifies that only the values of y in group 4 corresponding to 
values of w in group 1 are to be summed. Similarly — Rs2(y — ys)i, 
and — R6%(y — y^)i are the total corrections due to the «i5 values of w 
common to groups 1 and 5, and the nu values of w common to groups 
1 and 6. Finally, the sum of all these corrections divided by their 
number, JVi = n u + nu + nu, gives the correction to the average 
A*, which is the second term in brackets of equation (16). In the 
same way the corresponding expressions of equations (17) to (21) are 
obtained. 

Introduction of the six unknown regression coefficients, however, 
requires six additional equations, which are readily obtained. Any 
regression coefficient, for example Ri, is by definition 

J,{A - A) (x - xi) , . 

El = 2(x - Xl )* (22) 

where A denotes that each observed value of the dependent variable 
in group 1 is corrected to a constant value of the remaining indepen- 
dent variables. Since, in this demonstration, y is the only remaining 
independent variable, this correction for any particular value of w 
common to groups 1 and 4 is evidently given by d — Ri(y — y,j)i 
for the corresponding value of y, whence the part of the numerator of 
equation (22) due to all values of w common to groups 1 and 4 is 

SU 1 ' +[d- Ri(y - y 4 )i] - A} 4 {x - x^- 

Likewise the part of the numerator due to all values of w common to 
groups 1 and 5, and also to groups 1 and 6 are respectively 

2U 4 + [0 - Ri(y- y 6 )i] - A} B {x - Xi} 6 and 
2U* + [/ - R 6 (y - y 6 ),] - A} 6 [x - Xi} 6 . 

Rearranging and combining these three terms of the numerator, 
equation (22) becomes 



RELATION OF VAEIABLES. 109 

R l = {204* - A) (a: - Xi) + d2(x - *), + fS(x - zj, 

- R^(y — y 4 )i (x — Xi) 4 — R&(y — y 6 )iO - *i) 6 

- R&{y - y 6 )iO - Xi) 6 } 



2(a; - xi) 2 
Similarly equations (24) to (28) are derived 



(23) 



Re = {2(£< - B) (x - x 2 ) + dS(x - x 2 ) 4 +fZ(x - x 2 ) 6 - R£(y - y 4 ) 2 
(x — x 2 ) 4 - R£(y - y 6 ) 2 (a; - x 2 ) 6 — R£(y — y 6 ) 2 (x — x 2 ) 6 } 



2(x - x 2 ) 2 



(24) 



Rz = {2(C« - C) (x - Xs) + d2(a: - Xs) 4 +f(x - x,), - R£{y - y 4 )s 
(a; - Xs) 4 - R£(y - y 5 ) 3 (x - x^s - Re2(y - y 6 ) 3 (a - XjQe} 

2(*-x 3 ) 2 (25) 

fl 4 = {S(D* - D) (t/ - y 4 ) + oS(y - y 4 ) x + cZ(y - y 4 ) 3 - R$( x - *)« 

(26) 



(2/ - 74)1 - -R22(a; - x 2 ) 4 (y - y 4 ) 2 - fl 3 2(x - Xs) 4 (y - y 4 ) 3 } 



2(2/ - y 4 ) 2 

R, = {2(£* - E) (y - y 6 ) + oSfo - y 6 )i + c2(z/ - y 6 ) 3 - R£( x - z 1 ), 

0/ - y 6 )i - Ry2(x - x 2 ) 5 (y - y 5 ) 2 - R£(x - x 3 ) 5 (y - y 6 ) 3 } fvA 
2(2/ -y 6 ) 2 ~ (27) 

R e = [2(F * - F) (2/ - y 6 ) + dS(y - y 6 )i + cZ(y - y 6 ) 3 - -Ri2(x - x x ) 6 
(2/ ~ ye)i - i^2(x - x 2 ) 6 (2/ - y 6 ) 2 - fl 3 2(a; - x 3 ) 6 (y - y 6 ) 3 } 

2(2/ - y 6 ) 2 (28) 

For brevity, let 

M u = 2 (a; - Xi) 4j M 16 = 2 (a; - Xi) 6 , etc. 

P 4 i = 2(2/ - y 4 ) b £« = 2(1/ - y 4 ) 3 , etc. 

K u =2(x - Xi) 4 (2/ - y 4 ) x , i<L 15 = 2(x - Xi) 5 (y — y B )i, etc. 

L\ = 2(i - x x ) 2 , Z 2 = 2(x - x 2 ) 2 , etc. 

Finally, in group 1, for example, the observed average of w, A*, may be 
substituted for the required average A, in the expression '2{A i — A) 
(x — Xi) because the sum of the deviations (x — x x ) is zero. In other 



110 MCEWEN AND MICHAEL. 

words 'L{A i — A) (a; - x x ) = 2(4* — A*) (x — Xi). Introducing 
these equivalents into equations (16) to (21) and (24) to (28), equa- 
tions (29) to (40) are obtained 

A = A' + -^ {n u d + n 16 /} - ^ {R*P* + &Pa. + R«P 6 i] (29) 

B = B' + -J- {md + n«/} - -J- {RiPv + RsPv + R e P®} (30) 

C = C» + ^- i«34<Z + WJ - ^ W* + P5P53 + PePea} (31) 

JSi iV3 

D = D J + -L {w 41 « + W43 c} - 1- {ftlfM + RzMu + R 3 M M ) (32) 

JV4 JSi 

E = B' + -i- {«5ia + ns3c} - -£- {Ji* + P2M25 + P3M35} (33) 

JSi JSi, 

1 , ,1 



v . ,n«io + n 63 c} - — {RiMu + JWf*. + P 3 M 36 } (34) 

JS 6 -"6 



P = F J -|- 

iV 6 ' ^6 

(35) 



p m , M u d + Jfn/ - P 4 gu - Psgis - R »Ku 

r> r>.- 1 -^24^ + Mitf — RlK?A — RbK<% — ReKiS 

Re = «5H 75 



i| 



_ Ri , iM + M 36 / - #4^34 ~ #5^35 - R«K S 

— -"3 ~r ■ 75 



r> ni 1 P« a "I" P43 c — RiKu — RzKu — R3K 

Ki = /C 4 H -^ : 



i! 



_ _ . . Pad 4" P63C — RiKis — RvKzs — R3K; 

R« = Rl-\ ji 



L 



(36) 



D D« 1 xu 34"< T^ XU36J — «tiiM — il^i.35 — Jl/gll.36 /o _x 

itg = ivj + 75 V*') 

^3 



(38) 



T> T>i I P 6lCt ~^~ -f" 53 " _ Pl-^15 — RiKzi — P3.K35 , . . 

its = /I5 + — y2 W 



(40) 



where Pj, P^, R&, etc., are coefficients of regression of the dependent 
on the independent variable for groups 1, 2, 3, etc., respectively, com- 
puted from the original uncorrected values. The part added to this 
coefficient in each case is the correction which must be added to the 



TABLE 4. 
Process op Solution by Successive Approximation. 



Substitute 



in equations 



and obtain 



Line 



Second approximation determined directly 



a' 


e' 


Ri 


Rl 


Ri 


38, 39, and 40 


A Rf A Ri' A Ri' 


Rf 


Ri' 


Ri' 


1 


d' 


f 


Rf 


Ri' 


Ri' 


29, 30, and 31 


A A" A B" A C" 


A" 


B" 


C" a" c" AW AW 


2 


d' 


f 


R? 


Ri' 


Ri' 


35, 36, and 37 


A Rf A Ri' A R l 3 ' 


Rf 


Ri' 


Ri' 


3 


a" 


c" 


R? 


Ri' 


Ri' 


32, 33, and 34 


A D". A E" A P" 


D" 


E" 


■pa fin fa a 1 ^" A 1 /" 


4 



Third approximation determined directly and checked by first differences A 1 



a" c" Ri' Ri' Ri' 


38, 39, and 40 


A Rf'A Rf'A Ri" 


Ri" Rf ri" 




5 


AW A l c" Ami' A x Ri' AIRS' 


38, 39, and 40 


AW^'AW^'AWi" 


A R\"A Ri"A Ri" 




6 


d" f" R{" Rf Rf 


29, 30, and 31 


A A"'A B'"A C" 


AUi "Riii r*iii 


a iii c iii^l a iii^l c iii 


7 


A J d" A'/» AW^'AW^'AWi" 


29, 30, and 31 


A 1 A i "A 1 B i "A'0" i 


A A*"A B Wi A C"* 


A I a i "A , c i "A 2 a i "AV" 


8 


d" f" Rl" Ri" Ri" 


35, 36, and 37 


A Ej"A Ri"A Ri" 


r\" Ri" Ri" 




9 


AW A'f" AiRfAWfAiRi" 


35, 36, and 37 


Ami"Ami"Ami" 


A Ri"A Ri"A Rj" 




10 


a"' c'" Ri" Ri" Ri" 


32, 33, and 34 


A D'"A E^'A T'" 


T\iii ffiiii Viii 


^iiijiii^X^iii^lfUi 


11 


AWA i c'"A J Ri"A l Ri"A 1 Ri" 


32, 33, and 34 


^ljjiit^ljjiii^ljiiii 


A D"*A E"'A F*" 


A l d i "AH i "A 2 d'"A 2f '" 


12 



Fourth approximation determined by first differences and checked by second differences A 2 



AW i A l c'"A 1 Ri"A l Ri"A l Ri" 


38, 39, and 40 


A^rAlRj-A 1 ^* 


A ft j'A Ri v A ft J» 




13 


AWA 2 c"'A i Ri"A 2 Ri"A i Ri" 


38, 39, and 40 


A^rA^j'A 2 ^ 


A'^rA^fA 1 ^ 




14 


AW'A l f'"Am?Ami v Am%> 


29, 30, and 31 


A'A^B^A'P 




A'o^AV'AV'AV'' 


15 


A 2 d"'A 2 p"A 2 R?A 2 Rt'A 2 RZ' 


29, 30, and 31 


A^^B'^C 4 '' 


A l A i "A l B i ''A l C i1 ' 


A 2 a"AV v 


16 


AW'Ay'"A l R?AWtA l Ri v 


35, 36, and 37 


A^fA^jA^r 


A Ri'AWtA ftf 




17 


A 2 d'"A 2 f'"A 2 R i ?A 2 RtA 2 R% 


35, 36, and 37 


A^A^A 2 ^ 


A'tfrA^A'tfr 




18 


AWA 1 c' v A'R?A 1 R?A'Rp 


32, 33, and 34 


A 1 D i "A 1 E il A 1 F i '' 




A , d' v A 1 f v A 2 d iv A 2 f v 


19 


&WA 2 c' v A 2 RtA 2 R?A 2 Ri v 


32, 33, and 34 


tf-Qivtfftivtf-piv 


A'D^A^^AT** 


A 2 d' v A 2 f v 


20 



Supplementary explanations 



R\' = Rl + ARl' 

Rl" = Ri+ ARl" 

R? = Ri + Aftf 

etc. 



A l R{' = R{' - Ri = AR\' A 2 ftJ» = A l R{" - A'ftf* 

Ami" = Ri" - Ri' = &Ri" - &Ri' &Ri° = ^Ri° - & l Ri" 

AW? = R? - Ri" = AR\° - ARi" A 2 R\ = A l R\ - A l Rf 
etc. etc. ' 



For similar explanations 
of averages see Table 2. 



RELATION OF VARIABLES. Ill 

value computed, without taking into account the correlation between 
the independent variables. These twelve equations (29) to (40), to- 
gether with the four equivalents B — A = a> B — C = c, E — D = d, 
and E — F = / determine the six required averages and the six 
required regression coefficients, and may be most conveniently solved 
by a process of successive approximation similar to that already 
presented for the simpler case in which variability within the group 
is neglected. The details of this process are given in table 4. 

As in the simpler case (see p. 105), if the process of successive approxi- 
mation, indicated by lines 1, 2, 3, 4, and 5, 7, 9, and 11 results in 
convergence to definite limiting values, these values will satisfy 
equations (29) to (40). In the third approximation, the procedure 
indicated by lines 6, 8, 10, and 12, involving first differences, affords 
a numerical check on the computation of AR*f, AR™, etc., and AA Hi , 
AB"*, etc., of lines 5, 7, 9 and 11, and beginning with the fourth approxi- 
mation, second differences (lines 14, 16, 18, and 20), afford a check 
against the first differences. For a numerical illustration see page 122, 

If the reader has followed the reasoning thus far, he will doubtless 
feel that, although the regression method is formally complete, the 
labor involved in its application would, in many instances, be so great 
,as to make its use impracticable. To meet this objection, a slo-pe 
method has been devised which takes account of the variability with- 
in the group in nearly as accurate a manner, but one that eliminates 
half of the equations, namely, those similar to (23) to (28). The 
basis of this method is the fact that the slope of a chord of a simple 
curve is approximately equal to that of the tangent at the point 
midway between the extremities of the chord. Accordingly, the slope 

jj ^ 

= Si_2 of the chord whose extremities are (xi, A) and (x 2 , B) 

%i — Xi 

is approximately that of the tangent at the point whose abscissa is 
— -. Similarly, for the point midway between (x 2 , B) and (X3, C), 

r\ g 

the slope of the tangent is approximately = S2-3, and so on. 

X3 — X2 

But the slopes at the points (xi, A), (X2, B), and (X3, C) are required. 
That at (x 2 , B) is readily obtained by utilizing the rate of change in 
slope as a means of interpolating between Si_ 2 and S2-3. Thus 



112 MCEWEN AND MICHAEL. 

S2.-3 — S1-2 _ 2(S2_3 — S1-2) 



X3 + X 2 X2 + Xi X3 — Xi 



is the rate of change in slope, 



whence the slope at any point of abscissa x between (xi, A) and 

(x 2 , B) is S1-2 + \x — I — - — I — and putting 

a; = x 2 the slope at (x 2 , B) is 

S 2 = &_ 2 + ^Jll 1 (S M - S,_ 2 ) (41) 

X3 — Xi 

Similarly, the slope at any point of abscissa x between (X2, B) and 

2(52-3 ~ Si_ 2 ) 



x = x 2 , the slope at (x 2 , B) is 



and putting 



S 2 = S 2 _a + 5 * (S 2 _3 - Sud (42) 

X3 — Xi 

In the same way the slope corresponding to each abscissa between 
the extremes Xi and x B is found, where x„ denotes the last group aver- 
age. In the particular case at hand x is divided into but three groups 
so that x n = X3, whence the slopes Si and S3 must be determined by 
utilizing the rate of change in slope between Si_ 2 and S2-3 as a means 
of exterpolating beyond Si_ 2 and S 2 _3. Thus 

Si = Si_ 2 + ^^? (S 2 _a - Sw) (43) 

X3 — Xi 



and 



S 3 = S 2 ^ + ^ * (s M - Si_ 2 ) .(44) 

X3 - Xi 



Similarly equations (45) to (48) are obtained. 



S 4 = S 4 _ 5 + ^ 5 (S5-6 - S4-5) (45) 

Ye — V4 

St = S4-5 + ^-^ (S 6 _e - S4.5) (46) 

ye — y4 



RELATION OF VARIABLES. 113 



S5 — Ss_6 + (Ss_6 S4_5) 

ye - y4 


(47) 


St — S5-6 + (S5-6 ^4-5) 


(48) 



ye - y4 

Equations (23) to (28) are thus replaced by equations (41) to (48) 
whose solution depends solely upon the relation between the averages 
A, B, 0, etc. Thus, if S4, S5, and Se be substituted for R4, R&, and R& 
in equation (16) the last expression will be found to involve only d and 
f as unknowns. For S4.-5 and S5-6 of equations (45) to (48) are 

. E-D d E-F -/ 

respectively denned as = and = . 

ys — y4 ys — 7i ye — ys ye - ys 



4. Illustration of Method by Solution of a Particular 
Problem Concerning the Relation between Temperature, 
Precipitation, and Yield of Wheat in South Dakota. 1 

From 1891 to 1917 the mean air temperature for the month of 
June in South Dakota varied from 60.4° F to 73.4° F, while the total 
precipitation during the months of May and June varied from 3.5 to 
11.6 inches, and the yield of wheat (harvested in August) varied from 
4.0 to 17.0 bushels per acre. In attempting to ascertain what effect, 
if any, temperature and precipitation had upon the yield of wheat, 
Blair (1913; 1915) applied the method of simple linear correlation to 
the portion of the data then available, and found a strong negative 
correlation between temperature and yield, and a somewhat smaller 
positive correlation between precipitation and yield. But, he also 
found a high negative correlation between temperature and precipi- 
tation, which, in 1918, led him to bring the data up to date and to 
consider the question: "how much of the apparent relation between 
precipitation and yield is really due to the influence of precipitation, 
and how much is due to the simultaneous influence of temperature; 
and, similarly, how much of the apparent relation between tempera- 
ture and yield is due to precipitation." (Blair 1918, p. 71). He 

1 We desire to express our obligation to Mr. H. H. Collins who has made 
the computations involved in this illustrative problem and in many others. 
Without his aid publication would have been materially delayed. 



114 MCEWEN AND MICHAEL. 

applied the method of multiple linear correlation and found that, 
after eliminating the effect of precipitation, correlation between 
temperature and yield was reduced from — 0.62 to — 0.48, and, 
likewise, that, after eliminating the influence of temperature, correla- 
tion between precipitation and yield was reduced from + 0.49 to 
+ 0.22. The functional relation he obtained between yield, tempera- 
ture and precipitation is, in his notation, 

y = 11.2 - 0.48 ^ (f - 65.9°) + 0.22 ^ (p - 6.8) (49) 

where <j v = 3.0, <r< = 3.0, and <r p = 2.1 

In applying the method of successive approximation to these data 
it is not our purpose to discuss, except incidentally, the results obtained, 
but to give a simple, concrete illustration of the process actually fol- 
lowed, first, in the case when variability within the group is neglected, 
and second, in the case when this variability is taken into account. 
In both instances the same notation is employed as in the analytic 
demonstrations. It should be noted, however, that, in the case when 
variability within the group is neglected, three independent variables, 
x, y, and z, are used in the analytic demonstration, while, in this 
illustrative problem,only two are involved. 

In the first instance, then, the initial step, as shown in table 5, is 
to group the data with respect to temperature, arranging the twenty 
seven entries according to its ascending order of magnitude, and simi- 
larly, to group the data with respect to precipitation, arranging the 
twenty-seven entries according to its increase. Secondly, each series 
is divided into three groups of nine entries each (see p. 127), i. e. groups 
1, 2, and 3 of the data arranged with respect to temperature, and 
groups 4, 5, and 6 of the data arranged with respect to precipitation. 
Thirdly, opposite each entry in groups 1, 2, and 3 is entered the num- 
ber 4, 5, or 6 designating which precipitation group (y-group) the 
entry is in, and, similarly, opposite each entry in groups 4, 5, and 6 
is entered the number 1, 2, or 3 designating which temperature group 
(z-group) the entry is in. Lastly, the average wheat yield for each 
group (A*, B*, C*, D*, E', and F*), the average temperature for each 
of groups 1, 2, and 3 (xi, X 2 , and X3), the average precipitation for 
each of groups 4, 5, and 6 (y4, ys, and y 6 ), and the number of entries 
common to groups 1, 2, or 3, and 4, 5, or 6 {nu = ««, n^ = m\, etc.) 
are determined. Each of these steps is indicated in table 5. 



KELATION OF VARIABLES. 



115 



TABLE 5. 

Data Concerning Yield of Wheat in South Dakota from 1891 to 1917 

(Blair 1918, p. 73) Tabulated as Required bt the Method of 

Successive Approximation in the Case when Variability 

within the Group is Neglected. 



Grouped with respect to temperature, x 


Grouped with respect to precipitation,?/ 


Yield 


x 2/-group 


Yield 


y x-group 


17.0 bu. per acre 


60°.4 


F 6 


6.9 bu. per acre 


3.5 "I 


inches 3 


6.3 


61°.5 


6 


4.0 


3.6 


3 


12.2 


62°.6 


4 


6.6 


3.7 


3 


14.0 


62°.7 


* 4 


14.2 


3.8 


1 2 


12.8 


63°.7 


1 6 


12.8 


3.9 


1 3 


12.0 


63°.7 


1 5 
u 6 


8.5 


4.5 


12.5 


63°.9 


8.0 


4.6 


13.4 


63°.9 


6 


14.0 


5.3 


1 


11.2 


64°.2 


5 


12.2- 


6.0. 


1 


111.4= total 


566.6 = total nu=2 


87.2 = total 


38.9 = total ?i4i=2 


12.38 = A* 


62.95=xi m 6 =2 


9.69 = D» 


4.32 =y 4 7142=2 




7116 = 5 




7243 = 5 




tfi = 9 




JV 4 = 9 


15.2 


64.2 ' 


5 


9.0 


6.0 


3 


13.7 


64.4 


6 


15.2 


6.5 


2 


9.6 


64.5 


5 


9.6 


6.5 


o 2 


14.2 


64.8 


a 4 


11.2 


6.6 


a 3 


8.0 


65.0 


1 4 
s 5 

° 6 


12.4 


6.8 


' 1 3 


13.8 


65.0 


12.0 


6.9 


2 


12.9 


66.3 


13.8 


7.0 


10.7 


66.4 


5 


11.2 


7.7 


1 


14.1 


66.9. 


6 


10.7 


81 . 


2 


112. 2 = total 


587.5 = total nn=1 


105.1 = total 


62.1 = total 7t6i = 2 


12.47 = B» 


65.27 = *! 7i25=4 


11.68=E i 


6.90 = y 6 7S52 = 4 




rt2» = 3 




7163 = 3 




# 2 = 9 




#5 = 9 


11.2 


67.0) 


5 


12.9 


8.1 ' 


2 


12.4 


67.3 


5 


9.0 


8.1 


3 


9.0 


67.5 


6 


6.3 


8.2 


CD l 


12.8 


68.3 


1 4 


13.4 


8.4 


P. 1 


6.9 


69.4 


1 4 


14.1 


9.0 


■1 2 


9.0 


69.6 


* ! 


17.0 


9.0 


2 1 
° 1 


8.5 


70.3 


12.5 


9.5 


6.6 


70.6 


4 


12.8 


10.0 


1 


4.0 


73.4. 


4 


13.7 


11.6, 


2 


80.4= total 


623.4 = total nzi=5 


111.7 = total 


81.9 = total «6i = 5 


8.93 = C* 


69.27 =Xs re35=3 


12.41 =F»" 


9.10 =y 6 n« = 3 




»'6 = 1 




7163 =1 




#3=9 




#6 = 9 



116 MCEWEN AND MICHAEL. 

Selecting the middle group of each series (groups 2 and 5) as the 
"standard" (see p. 103) the equations, derived as on page 104 for 
determining the corrected averages, are 

A = 12.38 + ^ (2d + 5/) (50) 

B = 12.47 + I (2d + 3/) (51) 

C= 8.93 + J (5d + 10 (52) 

D = 9.69 + - (2a + 5c) (53) 

E = 11.68 + ^(2a + 3c) (54) 

F = 12.41 + ^ (5a + lc) (55) 

where 

a = B - A (56) 

c = B - C (57) 

d = E - D (58) 

/ = E - F (59) 

Following the method of solution given in table 2, the first approxi- 
mations to d and f 1 (d* = 11.68 - 9.69 = 1.99, f = 11.68 - 12.41 

= — 0.73) substituted in the second members of equations (50), 
(51) and (52) give ' 

AA« = - [2 X 1.99 + 5 (- 0.73)] = 0.037 (60) 

AB« = - [2 X 1.99 + 3 (- 0.73)] = 0.199 (61) 



1 In order to save labor and reduce the number of required approximations 
to a minimum, correction should be made first for the variable having the 
greatest apparent effect. In this particular case the apparent effects of 
temperature and precipitation are essentially the same, and first approxima- 
tions to d and / are used rather than those to o and c merely because it conforms 
to the procedure in table 2. 



RELATION OF VAKIABLES. 117 

AC" = - [5 X 1.99 + 1 (- 0.73)] = 1.024 (62) 

which are second approximations to the quantities that must be added 
to A', B { , and C\ respectively, to equal A, B, and C. Substituting 
the original averages A*, B*, and C* plus the quantities AA", AB", 
and AC" respectively into equations (56) and (57) gives second ap- 
proximations (a" and c") to a and c. That is 

a" = (12.47 + 0.199) - (12.38 + 0.037) = 

a i _|_ (ab« - AA") = a* + A x a" = 0.09 + 0.162 = 0.252 (63) 

c" = (12.47 + 0.199) - (8.93 + 1.024) = 

c i _|_ ( A b» _ A c") = c l + A x c" = 2.715 (64) 

Substituting a" and c" for o and c in equations (53), (54), and (55) 
gives 

AD" = i[2X 0.252 + 5 X 2.715] = 1.564 (65) 

AE" = i [2 X 0.252 + 3 X 2.715] = 0.961 (66) 

y 

AF" =i[5X 0.252 + 1 X 2.715] = 0.442 (67) 

which are second approximations to the quantities that must be 
added to D*, E 4 , and F* to give D, E, and F. Substituting the original 
averages D% E*, and F* plus the quantities AD", AE", and AF" into 
equations (58) and (59) gives second approximations (d u and /") to d 
and /. That is 

d» = d* + (AE« - AD") = d l + A*d" = 1.99 - 0.603 = 1.387 (68) 

fii = fi _|_ ( AE « _ A J.«) = fi + A lfii = _ 73 _|_ 51Q = _ Q 211 (gg) 

If this process be continued, the successive approximations will 
converge to the required corrections AA, AB, AC, AD, AE, and AF, 
and to the required differences a, c, d, and /. But, unless two com- 
puters are available to duplicate each other's work, a second process 
of computation is needed to check the results obtained by the first 
process. Beginning with the third approximation, this is accom- 
plished by computing directly the differences A 1 A i ", A 1 B" i , A'C'" 
etc., that must be added to AA", AB", AC", etc., to obtain AA" 4 , 



118 MCEWEN AND MICHAEL. 

AB*", AC"*, etc., and so on. For example, the third approximations 
found by substituting d" and /", given by equations (68) and (69), 
for d and /in equations (50), (51) and (52) are 

A A 4 " = 0.191 (70) 

AB«» = 0.238 (71) 

AC ui = 0-747 * ( 72 ) 

But by substituting the differences AW = d u - d { = - 0.603 and 
A 1 /" = /" -/'■ = 0.519 for d and/ in equations (50), (51) and (52) we 
obtain 

A i A i« = 0il54 (73) 

A i B i» = 39 (74) 

A i C ni = _ 0- 277 (75) 

which when added to AA» = 0.0.37, AB« = 0.199, and AC« = 1.024 
(equations 60 to 62) give AA { « = 0.191, AB«* = 0.238, and AC"*' = 
0.747, thus checking the results obtained in equations (70), (71), and 
(72). 

In the same way the values of AD*", AE m , and AF Wi are obtained 
directly and checked by first differences A 1 , and, beginning with the 
fourth approximation, computation of first differences is checked by 
that of second differences A 2 , as indicated in table 2. The values 
obtained in the second and third approximations computed by the 
processes just described, the values obtained in the fourth and higher 
approximations computed by first differences and checked by second 
differences, the final limiting values obtained in the seventh approxi- 
mation, and everything required for making each computation is 
given in table 6, which is for the most part self-explanatory. Suffice 
it merely to call attention to the fact that the coefficients 2 and 5 
of equation (50) multiplied respectively by the first approximation to 
d and / (1.99 and — 0.73), by the second approximation to A 1 ^ and 
A 1 / (— 0.603 and 0.519), and by the third approximation to A 2 d and 
A 2 / (0.533 and — 0.410), and divided by 9, gives the second approxima- 
tion, 0.037, to AA, the third approximation, 0.154, to A 1 A, and the 
fourth approximation, — 0.109, to A 2 A, and so on. But the coeffi- 
cients 2 and 5 of equation (53) multiplied respectively by the second 
approximation to a and c (0.252 and 2.715), the third approximation 
to A*a and A'c (— 0.115 and 0.316), and the fourth approximation to 



RELATION OF VARIABLES. 



119 



A 2 a and A 2 c (0.091 and — 0.268) and divided by 9, gives the second 
approximation, 1.564, to AD, the third approximation, 0.150, to A X D, 
and the fourth approximation, — 0.129, to A 2 D, and so on. Finally, 
each limiting value, for example, AA = 0.245, is obtained by adding 
to AA" = 0.037 the sum of all approximations to the first differences 
A^/t*. e., 0.154 + 0.045 + 0.008 + 0.001. 



TABLE 6. 

Numerical Solution by Method of Successive Approximation of South Dakota Wheat 
Problem in the Case when Variability within the Group is Neglected. 



Second and third approximations and final values (7th approximation) 


d 


/ AA AB AC 


a c AD AE 


AF 


< 


1.99 
1.387 


-0.73 

-0.211 0.037 0.199 1.024 
0.191 0.238 0.747 


0.252 2.715 1.564 0.961 
0.137 3.031 1.714 1.041 


0.442 
0.413 


I 

II 

III 


1.305 


-0.081 0.245 0.263 0.716 


0.108 3.087 1.738 1.053 


0.404 


VII 


Third and succeeding approximations to first differences 




A'd 


A 1 / A'A A l B A'C 


A'o A'c A'D A*E 


A'F 




-0.603 
-0.070 
-0.010 
-0.002 
-0.000 
-0.000 


0.519 — 

0.109 0.154 0.039 -0.277 
0.019 0.045 0.021 -0027 
0.002 0.008 0.004 -0.003 
0.000 0.001 0.000 -0.001 
0.000 0.000 0.000 -0.000 


-0.115 0.316 0.150 0.080 
-0.024 0.048 0.021 0.011 
-0 004 0.007 0.003 0.001 
-0.001 0.001 0.000 0.000 
-0.000 0.000 0.000 0.000 


-0.029 
-0.008 
-0.001 
-0.000 

-0.000 


II 
III 

IV 
V 
VI 
VII 


Fourth and succeeding approximations to second differences 




AH 


A 2 / A 2 A A 2 B A 2 C 


A 2 o A 2 c A 2 D A 2 E 


a 2 e 




0.533 
0.060 
0.008 
0.002 
0.000 


-0.410 

-0.090 -0.109 -0.018 0.250 
-0.017 -0.037 -0.017 0.023 
-0.002 -0.008 -0.004 0.003 
-0.000 -0.001 -0.000 0.001 


0.091 -0.268 -0.129 -0.069 
0.020 -0.041 -0.018 -0.009 
0.003 -0.006 -0.003 -0.001 
0.001 -0.001 -0.000 -0.000 


0.021 
0.007 
0.001 
0.000 


III 

IV 

V 

VI 

VII 


Coefficients used in multiplying by d, f, a, and c, etc. 


2 


5 of d and / in (50) 


2 

2 


5 of a and c in (53) 


2 


3 of d and /in (51) 


3 of a and c in (54) 


5 


lof dand/in (52) 


5 


1 of a and c in (55) 



120 MCEWEN AND MICHAEL. 

Substituting the final values, 
AA = I (2d + 5/) = 0.245, AB = \ (2d + 3/) = 0.263, etc., carried 
to two decimals, in equations (50) to (55) gives the corrected averages 

A = 12.38 + 0.245 = 12.62 (76) 

B =12.47 + 0.263 = 12.73 (77) 

C = 8.93 + 0.716 = 9.65 (78) 

D = 9.69 + 1.738 = 11.43 (79) 

E = 11.68 + 1.053 = 12.73 (80) 

F = 12.41 + 0.404 = 12.81 (81) 

and these substituted in equations (56) to (59) give 0.11 for a, 3.08 
for c, 1.30 for d, and — 0.08 for f, which agree to the nearest hundredth 
with the final values of these differences entered in table 6. An addi- 
tional check upon the computations is B = E, which should be the 
case since each of these two averages correspond to the standard 
values of temperature and precipitation. 

As stated on page 99, the functional relations found by this method 
are defined by the series of corresponding averages of dependent and 
independent variables. Many ways of utilizing these relations will 
occur to the reader. The most precise and, perhaps, the most desir- 
able way would be to plot the averages of w corresponding to Xi, 
%2, and X3 and those corresponding to y4, ys, and ye, and so determine 
the type of equation relating w to x and w to y; and then, by the 
method of least squares or method of moments, compute the constants 
from the original data. A more expedient way is to use the relations 
between the averages directly, and correct for the neglected varia- 
bility within the group. For this purpose the functional relations 
may be conveniently expressed as 

w = / x (x) + f 2 (y) = 12.73 + Fi(x) + F t (j) (82) 

where x and y signify any one of the three group averages of x and of y, 
and where Fi (x) and ^(y) are defined by the series in table 7. 



RELATION OF VARIABLES. 



12:1 



TABLE 7. 
Definition of Functional Relations. 



X 


Fl(T) 


y 


ft(j) 


mean 


limits 


mean 


limits 


62.95 = Xi 
65.27 = x 2 
69.27 =xs 


60.4 - 64.2 
64.2 - 66.9 
67.0 - 73.4 


- 0.11 = - a 


- 3.08 = - c 


4.32 = y 4 
6.90 = y 6 
9.10 = y 6 


3.5- 6.0 
6.0- 8.1 
8.1 - 11.6 


- 1.30 = - d 

0.08 = -/ 



In computing a value of w for given values of x and y, say x = 73.4 
and y = 3.6, correction is first made for the group which, from equa- 
tion (82) and table 7, is 



12.73 - 3.08 - 1.30 = 8.35 



(83) 



However, this value fo^ w corresponds not to x = 73.4 and y = 3.6, 
but, since variability within the group was neglected, to x = 69.27 
and y = 4.32. Correction for this neglected variability may be made * 
by running a linear regression in each group and introducing the 
regression coefficients into equation (83); by plotting the averages 
and ascertaining the type of equation as stated above, and then com- 
puting the constants from the averages, and replacing equation (83) 
by the equation so obtained; or by a simple process of interpolation 
like the following. Since — 3.08 is the change in w due to a change in 
x from X2 = 65.27 to X3 = 69.27, the change in w due to a change in x 
from X3 = 69.27 to x = 73.4 should be approximately proportional. 
That is the change in w due to the variability of x in group 3 is approxi- 



mately given by 



73.4 



69.27 



X (— 3.08) = - 3.09, and, in the 



69.27 - 65.27 
same way, the change in w due to the variability of , y in group 4 is 



approximately given by 



3.6 



4.32 



X (- 1.30) 



4.32 - 6.90 
ducing these approximate corrections into equation (83) gives 



0.36. Intro- 



w = 8.35 - 3.09 - 0.36 = 4.7 



(84) 



122 MCEWEN AND MICHAEL. 

which agrees well with the observed value of 4.0 given in table 5. For 
comparison, it is well to add that the linear regression equation (49), 
derived by the method of multiple correlation, is in our notation 

w = 11.2 - 0.48 (x - 65.9) + 0.31 (y - 6.8) (85) 

Putting x = 73.4 and y = 3.6, as above, gives w = 6.6, a value 
departing more widely from that observed. 

In illustrating the procedure when variability within the group is 
taken into account, the wheat data are arranged and grouped as shown 
in table 5. In addition, for the purpose of computing the required 
regression coefficients, the deviations A i — A', and x — Xi in group 1, 
B l — B* and x — x 2 in group 2, and so on to F* — F' and y — y 6 in 
group 6 are entered. However, in order to save labor in computing 
2(^4* — A*) (x — Xi), etc., the averages A*, Xi, B { , x 2 , etc., are replaced 
by the approximate values Aj, %[, Bj, x^, etc., carried only to as many 
places as the individual entries, and deviations from these values are 
the ones that are entered. To save space table 5, including these 
deviations, is not reproduced. 

Owing to substitution of the above approximate values for the 
true averages in equations (29) to (40), certain initial corrections must 
be applied before they are ready for solution. Let Xi = Xj -4- AXj 
and A' = Aj 4- AAj where Ax x and AAj are the respective correc- 
tions that must be added. Substituting these equivalents into the 
expression 

SC4' - A') (x - xO . 

; — = K{, as denned on page 108, gives 

2 (a; — Xi) 2 

Ri _ s U* - (A? + AAQ] \x - « + Ax;) ] 
2[*-( X ;+Ax 1 )] 2 

_ 2{A> - Aj) (a; - x|) - Ax^ 2_Q4« - Aj) - AA f 2 (a; - x^ - Ax[) 
2(a; - xi) 2 - 2Ax^ 2 (a: - %[) + 2(Ax^) 2 

^ ZU< - AjHg - x,) - ^(AAj) (Ax t ) m 

2(* - <) 2 - JV X (A X ;) 2 KBO) 

since 

2U* - Aj) = 2AA! = NiAA{ ) 

2 (x - x) = 2Ax[ = iViAx; (87) 

2(a; - x* - A$ = 2(x - x x ) = J 



RELATION OF VARIABLES. 123 

In the same way corrections for R$, R\, etc., are obtained. Correc- 
tions for M, P, and K are found as follows: 

Mu = 'S(a; — Xi) 4 = 2 (a; — x[ — AXj) 4 = 2(x — Xj) 4 — n a Ax\ (88) 
and similarly for Mu, Mu, etc. 

P 41 = 2(2/ - *)i = 2(2/ - y 4 - Ay^ = 2(j/ - y 4 ') - nuAy, (89) 
and similarly for P42, P43, etc. 

K u = 2(x - Xi) 4 (2/ - y 4 )i = 2(z - x^ - Ax,') 4 (2/ - y 4 - Ay^i 
= 2(a; - Xj) 4 (2/ - y 4 )i - Ay 4 2(a; - <) 4 - Ax[2(y - y^)i + 

n 14 (AxJ) (Ay 4 ) (90) 
and similarly for Ku, Ku, fete. 

Applying the foregoing corrections to the numerical values, and 
substituting these corrected values in equations (29) to (40), gives 

A = 12.3778 + - (2d + 5/) - - {2.6644P, + O.8OOOP5 - 0.4000P 6 } 
y y 



(91) 

)Rt} 
(92) 

>Pe} 
(93) 

>Rz] 
(94) 

(95) 

F - 12.4111 + - (5a + lc) - -{- 1.4225Pi + 1.8134R2 - 1.7733P 3 } 
y y 

(96) 
- 0.6290d - 1.4225/+0.8463P, - 0.9956P 6 - 1.9522P 6 



B = 12.4667 + - (2d + 3/) - -{- 0.2356P 4 + 0.5000P 6 + 1.4000P 6 } 
y y 

C = 8.9333 + - (5d + If) - -{- 2.3890P 4 - 1.3000P 6 - 1.0000P 6 } 
y y 

D = 9.6889 + - (2a + 5c) - * {- 0.6290Pi - 0.7244R2 + 5.6335Ps} 

y y 

E = 11.6778 + - (2a + 3c) - i{1.9710Pn - 0.9488P2 - 3.9199P 3 
y y 



Ri = 0.2763 -f 



13.2822 

(97) 



124 MCEWEN AND MICHAEL. 

-0.7244d+1.8134/-0.1723il,-2.0639.R5+3.3689# 6 



#2= -0.8264+ 



7.8556 • ,; -■ 

. , (98) 



D ! onoi , 5.6335d-1.7733/+3.3324R 4 -0.5667il5-1.7667^« 
ii3= — 1.2081H 



Ri = 2.0965 + 



33.1200 

/ • (my 

2.6644a - 2.3890c + 0.8463Ri - 0.1723^2 + 3.3324fl 3 - 

5.9156 

(100) 



n 1co -7 , 0.8000a - 1.3000c - 0.9956i?i - 2.0639E2 - 0.5667E 3 
R 6 = 0.1687 + ^^ ; 

(101) 

- 0.4000a - 1.0000c - 1.9522Ei+3.3324R2-1.7667E 3 



R 6 = 1.0161 



10.5400 

(102) 

These equations are solved by successive approximation exactly as 
indicated in table 4. The final values so obtained are 

A = 12.38 - 0.31 = 12.07 (103) 

B = 12.47 - 0.29 = 12.18 (104) 

C = 8.93 - 0.12 = 8.81 (105) 

D = 9.69 +2.64 = 12.33 (106) 

E = 11.68 + 0.50 = 12.18 (107) 

F = 12.41 + 0.34 = 12.75 (108) 

Rt== 0.28 + 0.03 = 0.31 (109) 

7*2= -0.83 + 0.36 = -0.47. (110) 

i? 3 = -1-21 + 0.00 =-1.21 (111) 

Ri = 2.10 - 2.09 = 0.01 (112) 

Ri= 0.17 - 1.00 =-0.83 (113) 

i? 6 = 1.02-0.33 = 0.69 (114) 

Utilizing these values in the most expedient way, the functional 
relations are conveniently expressed as 
w = fi(x) +/ 2 (2/) = 12.18 + F 1 (x) + F 2 (y) + R z (x - x) + R y (y-y) 

: :■:; __ ■„_ . ;■_; :;•/___.;._ ; _. ; . (us) 

where x and y signify any one bf the three group averages of x and y; 
where R x represents any one of the three regression coefficients Hi, Rq, 



RELATION OF VARIABLES. 



125 



and Rz, and R y any one of R4, Rt, and i?e; and, where Fi(x) and Fi(y) 
are defined in table 8. 

TABLE 8. 
Definition of functional relations. 



X 


■FM 


R* 


y 


F(y) 


Ry 


mean 


limits 


mean 


limits 


sti= 62.95 
x 2 =65.27 
Xj= 69.27 


60.4-64.2 
64.2-66.9 
67.0-73.4 


-0.11 = -a 


-3.37 = -c 


fli = 0.31 

fl 2 =-0.47 

fl 3 =-1.21 


74=4.32 
75=6.90 
76=9-10 


3.5- 6.0 
6.0- 8.1 
8.1-11.6 


0.15 = -d 


0.57 = -/ 


R,=0.01 

fl 6 =-0.83 

fl«=0.69 



The value of w corresponding to x = 73.4 and y = 3.6, may be 
computed, as in the case just considered, by correcting first for the 
group, which, from equation (115) and table 8 is 



w = 12.18 - 3.37 + 0.15 = 8.96 



(116) 



which corresponds to x = X3 = 69.27 and y = Yi = 4.32. Intro- 
ducing the regression coefficients to correct for the position within 
the group; #3(73.4 — X3) and Ri (3.6 — y*) must be added; whence 



w = 8.96 - 5.00 - 0.01 = 3.95 



(117) 



which agrees still better 1 with the observed value of 4.0 than that 
given by equation (83). 

Solution by the slope method is effected, after introducing the cor- 
rections indicated by equations (88) and (89), by replacing equations 
(35) to (40) with equations (41) to (48) and substituting Si, S 2 , S3, S4, 
S5, and St for Ri, R2, R%, Ri, R&, and Re, in equations (29) to (34). 

A = 12.38 + ~ (2d + 5/) - J {2.66(0.597d + 0.246/) 

+ 0.8(0.179d - 0.246/) - 0.4(-0.179d - 0.664/)} (118) 



1 . . . This agreement between computed and observed values is, of 
course, accidentally close. Considering the observed series as a whole, the 
greatest deviation between computed and observed values is 7.5 by Blair's 
method, 6.3 by the first method of successive approximation, and 5.3 by the 
second method; while the "probable error" of a single computed value in 
each case lies between 1 and 2 (see page 132). 



126 MCEWEN AND MICHAEL. 

B = 12.47 + J (2d + 3/) - U - 0.236(0.597d + 0.246/) 

y \y 

+ 0.5(0.179d - 0.246/) + 1.4(- 0.179d - 0.664/)} (119) 

C = 8.93 + l(5d +]/)_!{- 2.389(0.597^ + 0.246/) - 

1.3(0.179d - 0.246/) - 1.0(-0.179d - 0.664/)} (120) 

D = 9.69 + i (2a + 5c) --[- 0.629(0.590a + 0.092c) - 
y y 

0.724(0.273a-0.092c)+5.633(-0.273a-0.408c) } (121) 

E = 11.68 + i (2a + 3c) - i {l.971(0.590a + 0.092c, - 

0.949(0.273a-0.092c) -3.920(-0.273a-0.408cj } (122) 

F = 12.41 + ^ (5a + lc) - -{.- 1.422(0.590a + 0.092c) + 
y y 

1.813(0.273a-0.092c)-1.773(-0.273a-0.408c)} (123) 

These equations reduce to 

A = 12.38 + 0.0218d + 0.4750/ (124) 

B^= 12.47 + 0.2560a 7 + 0.4567/ (125) 

C = 8.93 + 0.720d + 0.0671/ (126) 

D= 9.69 + 0.4562a + 0.8100c (127) 

E = 11.68 + 0.0029a + 0.1258c (128) 

F = 12.41 + 0.5402a + 0.0637c (129) 

which are of the same form as equations (50) to (55). The solution 
gives A = 12.13, B = 12.13, C = 8.60, D = 12.55, E = 12.13, and 
F = 12.64; and, from equations (41) to (48), Si = 0.32, S 2 = - 
0.32, S 3 = - 1.44, S 4 = - 0.376, S 6 = 0.05, and S 6 = 0.413. 

Using these relations to compute the value of w corresponding to 
x = 73.4 and y = 3.6 gives 

w = 12.13 - 3.53 + 0.42 - 5.95 + 0.27 = 3.34 (130) 

a fair approximation to the value 3.95 given when regressions in each 
group were used. 



RELATION OF VARIABLES. 127 



5. Supplementary Considerations. 

For clearness in presenting the central idea of this method of suc- 
cessive approximation, and for convenience in the analytic demonstra- 
tions and concrete illustrations, certain matters were barely men- 
tioned that require further consideration. They are: first, manner of 
grouping; second, basis of mathematical reasoning; third, reliability 
of results; and fourth, meaning of end results. 
Manner of grouping. 

It is desirable so to group the values of each independent variable as 
to equally distribute the effect of "chance" on the corresponding 
average of the dependent variable. Among other things, this effect 
of chance increases in any particular group as the number of entries 
in that group decreases, whence, in order to distribute it equally, each 
group should contain approximately the same number of entries. But, 
this effect of chance in any particular group also depends upon the 
range in value of the independent variable within that group, and, for 
this reason, the entries should be so grouped as to make this range 
the same in each group. In practice, however, it is usually impossible 
to group the data so as to realize, even approximately, both of these 
conditions. Thus, in the illustrative wheat problem (see table 5), 
although the first condition is completely realized by dividing the 
twenty-seven entries into three groups of nine entries each with respect 
to both temperature and precipitation, the second condition is not, 
the temperature range being 3.8° F. in group 1, 2.7° F. in group 2, and 
6.4° F. in group 3, and the precipitation range being 2.5 inches in 
group 4, 2.1 inches in group 5, and 3.5 inches in group 6. When 
regressions are used it is probably best to meet the first requirement 
at the expense of the second, as is here done, but when variability 
within the group is neglected, this is not so evident. No general rule 
can be stated, and choice of the number of groups and number of 
entries in each is a matter of judgment, just as in fitting a parabolic 
function 

w = ao + air + <hx 2 + +biy + b 2 y 2 + 

to empirical data, the number of terms to retain is a matter of judg- 
ment. In either case the final result is fixed as soon as the choice is 
made. 

After deciding upon the number of groups and number of entries 
in each, it is not uncommon to find that the last value of the inde- 



128 MCEWEN AND MICHAEL. 

pendent variable in one group equals the first value in the succeeding 
group. This is the case, for example, in the illustrative wheat prob- 
lem where the highest temperature in group 1 (see table 5) is 64.2° F. 
which is also the lowest temperature in group 2. In such cases it 
might seem better to vary the number of entries in the groups so as 
to bring the point of division between different values of the indepen- 
dent variable, e. g., in the particular case cited, to increase the number 
of entries in group 1 and decrease that in group 2 by 1 so that the 
highest temperature in group 1 would be 64.2° F. and the lowest in 
group 2 would be 64.4° F. However, this is only an apparent advan- 
tage, for, in the case when variability within the group is neglected 
the error introduced depends primarily upon the range in value of the 
independent variable within the group and not upon the precise point 
of division. The latter is comparatively insignificant, and the little 
effect it does have in the end results decreases as the number of entries 
per group increases and as the range in value of the independent vari- 
able decreases. Finally, in the case when regressions are used the 
effect of a change in the point of division is almost if not quite elimi- 
nated. 
Basis of mathematical reasoning. 

As stated on page 100, the special form of expression, upon which 
the mathematical demonstration of the method of successive approxi- 
mation is based, implies that the change in the dependent variable, w, 
corresponding to a given change in any independent variable is 
negligibly affected by the magnitude of the constant values to which 
the remaining independent variables are reduced. Stated in the 
concrete terms of the wheat problem, the assumption is that a change 
of say two degrees in temperature increases the wheat yield by essen- 
tially the same amount whether the precipitation is 3.5 or 11.6 inches. 
It is evident that this involves a second assumption; that the change 
in w is approximately independent of the magnitude of w, or, expressed 
in terms of the wheat problem, that an increase in temperature, say 
from 62.9° F. to 65.3° F., increases the wheat yield by approximately 
0.11 bushels per acre (see table 7), irrespective of whether the yield 
corresponding to 62.9° F. is 4 or 17 bushels per acre. Although both 
of these assumptions are likewise inherent in the method of multiple 
correlation, which, as a matter of fact, is but a special case of this 
more general method of successive approximation, the question at 
once arises: to what extent are these assumptions valid, and how 
may one proceed in any particular case when they are known not to 
be valid? 



RELATION OF VARIABLES. 129 

This is not an easy question to answer. However, the validity of 
the second assumption obviously depends upon distribution of the 
values of the dependent variable. If they are distributed in approxi- 
mate accordance with the Gaussian, or normal law of error, the second 
assumption may be regarded as true, while, if their distribution devi- 
ates widely from this law, it is clear that this assumption is not valid. 
In such cases it is desirable to find some function of the dependent 
variable that is distributed in accordance with the Gaussian law, and 
deal with that function. For example, in many, if not most, biologi- 
cal problems, the change in w is not independent of the magnitude of w, 
but approximately proportional to it, and the logarithm of w, rather 
than w itself, is distributed in accordance with the Gaussian law. 
Accordingly, the proper form of expression, upon which to base 
the mathematical reasoning, becomes (w + k) = fi(x) Xfziy) Xfsiz) 
etc., where, for greater generality, the constant k is introduced. But, 
this expression may be written 

209(10 + k) = log fa) + logf 2 (y) + bgf 3 (z) +... (131) 

and putting W = log(w + k), Fi = logfi, F 2 = logfz, etc., gives 

W = F 1 (x) + F 2 (y) + F 3 (z) +... (132) 

which is of the same form as that given by equation (1). In general, 
then, the nature of the frequency distribution of the dependent vari- 
able, affords an empirical criterion for determining the validity of the 
second assumption, and for suggesting what function of w to use to 
make this assumption valid. 

But this does not wholly eliminate the difficulty. For although 
the first assumption is, as a rule, valid whenever the second one is, 
this is not necessarily the case. Furthermore, there seems to be no 
criterion in the data themselves for determining this. However, in 
any case, after the "standard" values are selected, the ascertained 
change in the dependent variable corresponding to a given change in 
any particular independent one, is the change that would take place 
under average conditions of the remaining independent variables. 
It is obvious, then, that, when the range in value of the independent 
variables is small, the error introduced by this assumption is corre- 
spondingly small. Therefore, by classifying the data so as to restrict 
the range of those independent variables with respect to which the 
assumption does not accord well with fact, and applying the method 
of successive approximation separately to each portion of the data 



130 MCEWEN AND MICHAEL. 

thus segregated, the nature and magnitude of the inherent error can 
be found, and approximately eliminated. 

Finally, the method of group averages involves no assumption as to 
the nature of the correlation between the independent variables. 
Consequently, the ascertained relation, say of w to x is not influenced 
by the way x may be correlated with y, z, etc., in the data used, and 
values of w computed on the basis of these ascertained relations are 
quite as reliable for any one particular combination of values of the 
independent variables as for any other combination within the ranges 
covered by the data. On the other hand, any method based upon 
assumed forms of functional relations is likely to lead to erroneous 
conclusions as to the relative importance of the various independent 
variables unless the assumed forms fit the data well. Values of w 
computed from results based upon unsuitable functional forms may 
agree well with those observed when the relation between the inde- 
pendent variables accords well with that occurring on the average 
in the data from which the results were obtained, but computed and 
observed values disagree widely when this relation differs materially 
from that prevailing in the original data. In other words, the error 
inherent in the initial assumption is so distributed among the various 
independent variables as to give the best possible fit to the data as a 
whole for the functions used, and this fit may seem accurate and still 
be highly artificial. 

For example, application of the group method to the wheat problem 
(see tables 7 and 8) shows clearly that the relation of yield to tempera- 
ture and precipitation was not linear, and indicates that temperature 
was a much more important factor than precipitation. Consequently, 
if those values of the yield corresponding to a narrow temperature 
range and a wide range of precipitation be computed from tempera- 
ture data alone the error should be nearly the same as when both 
temperature and precipitation data are used. In eleven cases the 
temperature lies between 63.7° F. and 65.0° F., while the precipitation 
varies from 3.7 to 11.6 inches. The standard deviation of the differ- 
ences between observed values of the yield and those computed from 
temperature data alone is 2.03 for the multiple linear correlation 
method, and 1.96 for the slope method. Introducing the correction 
for precipitation decreases the standard deviation in the first case 
by only 0.01 and increases it in the second case by only 0.01, thus 
agreeing with expectation that no significant change would result. 
But the standard deviation of the differences between observed and 
computed yields for those data having a large temperature range and 



RELATION OF VARIABLES. 131 

a small precipitation range should be materially less when based upon 
both temperature and precipitation, than when based upon the latter 
alone. Moreover, reduction of the standard deviation effected by 
introducing the correction for temperature should be materially greater 
for the slope method than for that of multiple linear correlation. In 
eight instances the temperature ranges from 62.6° F. to 73.4° F., while 
the precipitation varies only from 3.5 to 5.3 inches. The standard 
deviation is reduced by 2.85, or from 4.94 to 2.09, for the slope method 
by introducing the correction for temperature, while for the method 
of multiple linear correlation, the standard deviation is reduced by 
only 0.96, or from 3.53 to 2.57. These results are not only in agreement 
with expectation but clearly indicate that part of the actual effect 
of temperature on yield is attributed by the method of multiple 
linear correlation to precipitation. To further test the truth of 
this indication, each value of the wheat yield given in table 5 was 
computed on the basis of the temperature relation ascertained by the 
slope method, and a linear correlation run between the residuals and 
precipitation. As a result the precipitation-yield correlation coeffi- 
cient is reduced from + 0.22 to + 0.015. 
Reliability of results. 

Experience proves that, after all practicable efforts are made toward 
controlling or correcting variations attributable to the independent 
variables, some deviation between computed and observed values of 
the dependent variable always remains. Although the magnitude 
of these deviations is an index of the accuracy with which the empirical 
relations describe the data at hand, one usually needs an estimate 
of the reliability with which these empirical relations describe the 
whole " universe" of which those data are a sample. Accordingly, the 
complexity of these relations must be taken into account. This need 
is particularly evident in the method of group averages, for, while 
the empirical relations describe the data at hand with increasing pre- 
cision as the number of groups approaches the number of observations, 
reliability of this description decreases until, in the limiting case, it 
is no greater than that afforded by the isolated observations them- 
selves. It is evident, then, that in estimating reliability, the actual 
number of observations should be reduced by an amount depending 
on the total number of groups chosen to express the empirical relation, 
and, for this purpose, each group in which a regression is used is equiva- 
lent to two groups. An approximate rule is to add the number of 
independent variables less 1 to the total number of observations and 
to subtract from this sum the number of groups plus the number of 



132 -MCEWEN AND MICHAEL. 

regression coefficients. A similar procedure is followed in the method 
of least squares when, in determining the " probable error" of a value 
computed by a formula having m unknown coefficients, the number 
of observations is decreased by m. 

Applying this rule to determine the reliability of the estimated 
yield of wheat, the reduced number of observations is 27 — 3 = 24 
for the multiple linear correlation method; 27 — 5 = 22 for the group 
method used when variability within the group is neglected, and also 
for the slope method; and 27 — 11 = 16 for the group method in 
which regressions are used. The reliability for each of the four 
methods, given by the probable error of a single observation, is 



0.6745 Ji^? = * 1.69, 0.6745 Ji^ = * 1.70, 

0.6745 J ^? = ± 1.63; and 0.6745 A /l^ = =*= U 
N 22 ^16 



22 V 16 

In case the results given by any one of the three group methods are 
"smoothed" by any process, a simpler relation is obtained which 
justifies a corresponding increase in the number of observations on 
which to base estimates of reliability. For example, in the case when 
regressions are used, the functional relation is given independently 
in two ways; first, by the series of corresponding averages of de- 
pendent and independent variable; and, second, by the regression 
coefficients and corresponding values of the independent variable. 
Accordingly, if the functional relations defined in these two ways be 
averaged, subtraction of the number of regression coefficients from 
the total number of observations would not be required to obtain the 
reduced number for estimating reliability, and the probable error of 
±1.87 would be correspondingly reduced . The slope method amounts 
to averaging these relations at the outset. 
Meaning of end results. 

In order to visualize the results obtained by this method of suc- 
cessive approximation, each functional relation may be represented 
graphically. In the case when regressions are computed by the usual 
method, or estimated from the averages, the functional relation of say 
w to x is approximately represented by a series of straight lines, having 
slopes equal to the regression coefficients and drawn through points 
whose respective coordinates are Wi, Xi; W2, x 2 , etc. As the number 



RELATION OF VARIABLES. 133 

of observations and groups increases each point tends to coincide 
with a point on the curve corresponding to the true functional rela- 
tion, and each line tends to coincide with the tangent at that point. 
In case regressions are not used, the result is simply a series of points 
determining the curve. 

Lastly, as is the case in applying any statistical method, the end 
result gives the relation of dependent variable not to each measured 
independent one, as such, but to a correlated system of unobserved 
variables which each measured one represents. In other words each 
observed variable is an index of a certain complex consisting of a 
number of unknown elements, and a statistical method merely serves 
to eliminate the effects of all but one of these complexes at a time and, 
by synthesis, to determine the effect of any combination of them. 
The method of successive approximation offers no exception to this 
rule, and, in order to further disentangle these complexes, additional 
variables must be measured and the number of observations increased. 



6. Literature Cited. 

Blair, T. A. 

1913. Rainfall and spring wheat. Mo. Weather Rev., 41: 

1515-1517. 
1915. Temperature and spring wheat in the Dakotas. Mo. 

Weather Rev., 43: 24-26. 
1918. Partial correlation applied to Dakota data on weather 

and wheat yield. Mo. Weather Rev., 46: 71-73. 

Bocher, M. 

1909. An introduction to the study of Integral Equations. 
Cambridge Tracts in Mathematics and Mathematical 
Physics. No. 10, 71 pp. 

Fredholra. L. 

1900. Sur une nouvelle me"thode pour la resolution du probleme 
de Dirichlet. Ofr. Kongl. Vet. Ak. Forh. Stockholm, 
57: 39-46 
1903. Sur une classe d'equations fonctionnelles. Acta Math., 
27: 365-390. 



